Search CORE

4 research outputs found

Exploring Fully Offloaded GPU Stream-Aware Message Passing

Author: Kandalla Krishna
Kaplan Larry
Namashivayam Naveen
Pagel Mark
White III James B
Publication venue
Publication date: 27/06/2023
Field of study

Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the CPU to orchestrate the data transfer operations. A new offload-friendly communication strategy, stream-triggered (ST) communication, was explored to allow offloading the synchronization and data movement operations from the CPU to the GPU. A Message Passing Interface (MPI) one-sided active target synchronization based implementation was used as an exemplar to illustrate the proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used to explore the various performance aspects of the implementation. The offloaded implementation shows significant on-node performance advantages over standard MPI active RMA (36%) and point-to-point (61%) communication. The current multi-node improvement is less (23% faster than standard active RMA but 11% slower than point-to-point), but plans are in progress to purse further improvements.Comment: 12 pages, 17 figure

arXiv.org e-Print Archive

GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks

Author: Austin Brian
Balma Jacob
Chunduri Sudheer
Groves Taylor
Kandalla Krishna
Kumaran Kalyan
Lockwood Glenn
Machinery Assoc Comp
Mendygral Peter
Parker Scott
Warren Steven
Wichmann Nathan
Wright Nicholas
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Crossref

eScholarship - University of California

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters

Author: Dhabaleswar K. Panda
Gopal Santhanaraman
Hari Subramoni
Krishna Kandalla
Matthew Koop
Publication venue
Publication date: 28/10/2009
Field of study

The increasing demand for computational cycles is being met by the use of multi-core processors. Having large number of cores per node necessitates multi-core aware designs to extract the best performance. The Message Passing Interface (MPI) is the dominant parallel programming model on modern high performance computing clusters. The MPI collective operations take a significant portion of the communication time for an application. The existing optimizations for collectives exploit shared memory for intranode communication to improve performance. However, it still would not scale well as the number of cores per node increase. In this work, we propose a novel and scalable multileader-based hierarchical Allgather design. This design allows better cache sharing for Non-Uniform Memory Access (NUMA) machines and makes better use of the network speed available with high performance interconnects such as InfiniBand. The new multi-leader-based scheme achieves a performance improvement of up to 58 % for small messages and 70 % for medium sized messages

CiteSeerX

Crossref

Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems

Author: Dhabaleswar K. Panda
Khaled Hamidouche
Krishna Kandalla
Lan Z.
Miao Luo
Xiaoyi Lu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref